Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: consolidate msm and batch-msm implementations to one function #342

Merged
merged 2 commits into from
Jan 18, 2024

Conversation

yshekel
Copy link
Collaborator

@yshekel yshekel commented Jan 16, 2024

refactor: consolidate msm and batch-msm implementations to one function

  • now batch-msm support parallel BM accumulation in addition to large triangle accumulation

@yshekel yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch 2 times, most recently from 2234b3f to b6fb5c0 Compare January 16, 2024 16:03
@yshekel yshekel marked this pull request as ready for review January 16, 2024 16:03
@yshekel yshekel marked this pull request as draft January 16, 2024 16:14
@yshekel yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch from b6fb5c0 to 4581424 Compare January 17, 2024 11:31
    - now batch-msm support parallel BM accumulation in addition to large triangle accumulation
@yshekel yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch from 4581424 to 946eb10 Compare January 17, 2024 14:07
@yshekel yshekel marked this pull request as ready for review January 17, 2024 14:10
Copy link
Contributor

@HadarIngonyama HadarIngonyama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me apart from the comments. In addition - which tests and benches did you run? we need to make sure that batched and non batched are passing and that performance was not affected.

S* scalars,
A* points,
int size,
unsigned batch_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of these 3 variables is unclear, please add a description. Also, it seems that only 2 out of 3 are needed no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add comments and rename variables to make the code clearer.

Copy link
Collaborator Author

@yshekel yshekel Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think 2 variables are enough:

  • msm_size is the number of elements per msm sum
  • batch_size is how many msms to compute in the batch
  • nof_points can either be msm_size (if sharing points) or msm_size*batch_size (if not sharing points).

maybe I can replace nof_points by a boolean that tells if sharing points between batch elements or not. It would still be 3 variables and require modification of 'MSMConfig' which would break the API so I prefer avoiding it.

note that nof_scalars is always assumed to be msm_size * batch_size. Theoretically MSMs could share scalars but not points but I think it's not an interesting case and again I want to be fully compatible with current API.

Copy link
Contributor

@DmytroTym DmytroTym Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave three variables. By design it should be possible to have different points in the first and second MSMs and then re-use points from the first for the third and then points from the second for the fourth and so on...
In this case not_points = 2 * msm_size

@@ -530,7 +542,7 @@ namespace msm {
CHK_IF_RETURN(cudaFreeAsync(sort_single_temp_storage, stream));

// find large buckets
unsigned avarage_size = size / (1 << c);
unsigned avarage_size = msm_size / (1 << c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making sure that this is the size of a single msm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is. I will rename variables to make the code clearer.

if (is_big_triangle) {
CHK_IF_RETURN(cudaMallocAsync(&final_results, sizeof(P) * nof_bms, stream));
if (is_big_triangle || c == 1) {
CHK_IF_RETURN(cudaMallocAsync(&final_results, sizeof(P) * nof_bms * batch_size, stream));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be used a lot, maybe define total_nof_bms = nof_bms*batch_size

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

nof_bms = bitsize;
CHK_IF_RETURN(cudaMallocAsync(&final_results, sizeof(P) * nof_bms, stream));
// Note: the reduction ends up with 'target_windows_count' windows per batch element. for batch=1 we can
// only sum the first 'bitsize' elements to avoid a few ECADD which add zeros
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't this true for every batch size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true for every batch size.
The thing is that for fields like BLS12-377 where the scalar field is 253b we end up with >253 bms (e.g. 254 bms for c=2 with 1 iteration of reduction from 127 to 254 bms. For other cases can be a little more than that) where the most significant ones are guaranteed to be zero.
The original code summed the bms based on bitsize so effectively summing only the non-zero bms. It is slightly more efficient more summing all bms (254 in the example I gave). I wrote the code this way for batch=1 since I did not want to end up with worse performance (although maybe negligible).

generally for batch>1 we must account for all bms so that the other batch elements are summing the right bms. If setting bm=253, the second batch element is summing wrong bms and all results end up wrong except for the first batch element.

Do you think I should always set bm based on the final number of windows? it is slightly less efficient but more drastic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not have each msm sum only the non zero buckets? (and make sure that the buckets are the correct ones - I'm pretty sure you can use bm=253 and correct the kernel to sum the correct buckets) let's go for best performance

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think you are right. I will do that.

final_accumulation_kernel<P, S>
<<<1, 1, 0, stream>>>(final_results, is_result_on_device ? final_result : d_final_result, 1, nof_bms, c);
// launch the double and add kernel, a single thread per batch element
NUM_THREADS = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use 32 threads min

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

config.are_scalars_montgomery_form, config.are_points_on_device, config.are_points_montgomery_form,
config.are_results_on_device, config.is_async, stream));

unsigned c = config.batch_size > 1 ? ((config.c == 0) ? get_optimal_c(msm_size) : config.c) : 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 16 the default for a single msm? it should also be the closest power of 2 to the optimal c

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was like that for batch=1 and so I did not want to end up with different behavior.

Should I use the closest power of 2 instead? closest from below? I mean does 12 goes to 8 to 16?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember.. use the value that gives better performance:)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the optimum is log(N) (meaning the number of ECADD is minimized) but implementation-wise power of 2 work better from what I read. In addition memory grows exponential in c since you have 2^c buckets per BM so practically c=16 is the largest power of two that is reasonable.

Having said that, the code says:
unsigned get_optimal_c(int bitsize) { return max((unsigned)ceil(log2(bitsize)) - 4, 1U); }

I don't understand why '-4'? It is quite odd since I then round it back to a power of two.

Anyway I will try to play with it a little and figure out what value are working better, but I think this is an optimization we can dedicate another PR and task for since I think it's a heuristic and not really part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, for now just change 16 to get_optimal_c. (which will then be rounded to the closest power of 2)
the logn-4 is the value that minimizes the amount of ec additions for most sizes, but it's true that with the actual implementation there are other considerations too. Ideally we will have a dictionary.

Copy link
Contributor

@DmytroTym DmytroTym Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed in the daily that fine-tuning the value of c is probably out of the scope of this PR..
Imo it'll more appropriate to tabulate best values of c and optimal accumulation method (triangle vs. non-triangle) when we establish proper benchmarking and run out of options to significantly speed up the MSM (potentially altering choices that we hard-coded). I think for now it's ok to leave reasonable defaults and give users an option to edit them via config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave it as is for now then, meaning the behavior remains as it is today.

@yshekel
Copy link
Collaborator Author

yshekel commented Jan 17, 2024

Looks good to me apart from the comments. In addition - which tests and benches did you run? we need to make sure that batched and non batched are passing and that performance was not affected.

Thanks for the comments, I will fix and push an additional commit.

The PR is testing all the curves, 15 tests each, including 3 MSM tests: msm, batch-msm, skewed-msm. They all Pass. It's the "Test / Test Rust on Linux" check.

Regarding performance: @DmytroTym (please correct me if wrong) profiled batched-msm with this PR: (a) batch=16 N=2^16 and also (b) batch=3 N=1000. For case (a) he measured ~60.7[ms] compared to ~413.8[ms] for the old implementation that used the triangle accumulation. For case (b) he measured ~12.7[ms] compared to the old ~8.9[ms]. In that small case the triangle accumulation is slightly better but of course it is configureable, and maybe a better choice of c can also improve the performance.

I think it will be very useful to have a benchmark to measure such modifications but AFAIK there isn't one. Is that right?

@HadarIngonyama
Copy link
Contributor

Looks good to me apart from the comments. In addition - which tests and benches did you run? we need to make sure that batched and non batched are passing and that performance was not affected.

Thanks for the comments, I will fix and push an additional commit.

The PR is testing all the curves, 15 tests each, including 3 MSM tests: msm, batch-msm, skewed-msm. They all Pass. It's the "Test / Test Rust on Linux" check.

Regarding performance: @DmytroTym (please correct me if wrong) profiled batched-msm with this PR: (a) batch=16 N=2^16 and also (b) batch=3 N=1000. For case (a) he measured ~60.7[ms] compared to ~413.8[ms] for the old implementation that used the triangle accumulation. For case (b) he measured ~12.7[ms] compared to the old ~8.9[ms]. In that small case the triangle accumulation is slightly better but of course it is configureable, and maybe a better choice of c can also improve the performance.

I think it will be very useful to have a benchmark to measure such modifications but AFAIK there isn't one. Is that right?

I believe there is a "benches" folder isn't that so?
also - best to have the msm use the "big triangle" automatically only for the parameters where it improves performance - have it as a default in the code.
another check that needs to be done is to check that the performance of a single msm (batch size=1) has not degraded as a result of the new changes.

@yshekel
Copy link
Collaborator Author

yshekel commented Jan 17, 2024

Looks good to me apart from the comments. In addition - which tests and benches did you run? we need to make sure that batched and non batched are passing and that performance was not affected.

Thanks for the comments, I will fix and push an additional commit.
The PR is testing all the curves, 15 tests each, including 3 MSM tests: msm, batch-msm, skewed-msm. They all Pass. It's the "Test / Test Rust on Linux" check.
Regarding performance: @DmytroTym (please correct me if wrong) profiled batched-msm with this PR: (a) batch=16 N=2^16 and also (b) batch=3 N=1000. For case (a) he measured ~60.7[ms] compared to ~413.8[ms] for the old implementation that used the triangle accumulation. For case (b) he measured ~12.7[ms] compared to the old ~8.9[ms]. In that small case the triangle accumulation is slightly better but of course it is configureable, and maybe a better choice of c can also improve the performance.
I think it will be very useful to have a benchmark to measure such modifications but AFAIK there isn't one. Is that right?

I believe there is a "benches" folder isn't that so? also - best to have the msm use the "big triangle" automatically only for the parameters where it improves performance - have it as a default in the code. another check that needs to be done is to check that the performance of a single msm (batch size=1) has not degraded as a result of the new changes.

I did not see a 'benches' directory. Maybe it was removed.
I will try to see where big-triangle is more efficient in terms of N and batch-size or maybe N*batch_size and default to that.
Regarding batch=1 performance, I will verify it did not degrade but basically this is the same implementation exactly so other than multiplying by 1 in a few places (on host), I don't think it should be different at all.

Copy link
Contributor

@DmytroTym DmytroTym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with just one change request related to rounding c to a power of 2
@HadarIngonyama in v1.0.0 we removed benches for now (but we should return them asap). But I indeed profiled some batched MSMs and saw improvements (mostly), the perf of singular MSM is unchanged

// reduce c to closest power of two if not using big_triangle reduction logic
// TODO: support arbitrary values of c
if (!config.is_big_triangle) {
while ((c & (c - 1)) != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this procedure just hang for some values of c? Like, if c=b101, then c & (c - 1) doesn't add any extra bits so the process is never terminated. Also since the idea is rounding up I'm afraid any MSM of size bigger than 2^20 will end up with c=32 which takes way too much memory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will never hang and it rounds down, not up:

  • for powers of two exactly 1 bit is 1'b1 --> meaning 'c & (c-1)' is zero. for example for c=4'b0100, c-1 = 4'b0011.
  • otherwise each iteration is masking one bit to 1'b0, making the number smaller and therefore it must end in at most 31 iterations.
  • note that masking c with any number cannot result in a number larger than c, meaning it is rounding down.

in your example c=b101 therefore c-1=b100 --> c& (c-1) = b100. Note that the first non zero bit is masked to 1'b0 and next iteration would mask the next non zero bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry you're right

@@ -338,59 +340,63 @@ namespace msm {
// this function computes msm using the bucket method
template <typename S, typename P, typename A>
cudaError_t bucket_method_msm(
int bitsize,
int c,
unsigned bitsize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a style related note: google guide advices to basically use int everywhere. I know that we're still not following this rule fully, but we at least tried to do it in higher-level and public functions. Do you think it's a good idea to stick to it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know someone who is very experienced with CUDA and he uses uint32_t wherever he can:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well limbs obviously shouldn't be int. I'll just quote google guide to be more precise:

We use int very often, for integers we know are not going to be too big, e.g., loop counters. Use plain old int for such things.
...
You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this.

Copy link
Collaborator Author

@yshekel yshekel Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed it in the daily meeting.
I personally prefer unsigned values (and actually uint32_t as Hadar mentioned since it's more portable and can avoid nasty bugs since sizeof(int) is not guaranteed to be 4 in all systems) when the value should not be negative because it tells the reader that the value should not be negative.

Two notes:
(1) google style suggests asserting the values are not negative instead of using unsigned int. I am not sure it is applicable to kernel code.
(2) when using std containers like std::vector, their size() method returns an unsigned value and therefore the compiler will warn about comparison if the loop counter is signed. Iterators are probably the better design pattern in those cases but this is not really the point here.

Anyway, unless you object, I will leave it as for now. Note that it's not exposed APIs but internal implementation.

@yshekel yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch 2 times, most recently from 0260d6e to ae37962 Compare January 18, 2024 11:15
@yshekel yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch from ae37962 to 01ea7b6 Compare January 18, 2024 11:37
@yshekel yshekel merged commit 56fcd2e into dev Jan 18, 2024
11 of 12 checks passed
@yshekel yshekel deleted the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch January 18, 2024 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants